Tools for Efficient Workflows
Part 1: Automatable reports

Lukas Lehner and Maximilian Trenkmann
with thanks to Julia Schulte-Cloos

2023-01-12

👋🇼‌🇪‌🇱‌🇨‌🇴‌🇲‌🇪‌🙋

Our plan

  1. Automatable reports

  2. Version control

  3. Dissemination and academic websites

  4. Containerisation for reproducible environments

  5. Encryption and advanced programming

Session 1: Automatable reports

Part I: Reproducible Research

Why should you care about reproducible research?

🙌 Benefits yourself! 🙌

‘Create a better relationship with your future self’

Why should you care about reproducible research?

🚀 That’s the future of the social sciences! 🚀

Reproducibility vs. replication? 🤔

Replicability refers to situations in which a researcher obtains new data to reach the same scientific conclusions as a previous study, whereas reproducibility refers to situations in which the original researcher’s software, code, and data are used to regenerate the results.

Replication standards: guidelines, protocols, and software designed to help researchers share, analyze, archive, preserve, distribute, catalog, translate, verify, and replicate scholarly research data and analyses across disciplines. Includes proposals to improve the norms around data sharing and replication in scientific research.

What hinders reproducible research and what can facilitate it?


Obstacles 🚧

  • Infrastructure and research habits
  • Hardware requirements
  • Operating systems
  • Versions of software and libraries

Solutions ✨

  • Optimised workflows (integrating coding, authoring, version control)
  • Virtual machines for computationally intense analyses
  • Containerisation

Why Open Data?

Efficiency 🏇

Science is not built upon blind trust, but on verifiability. Science as “organized skepticism” (Merton, 1947). Only when raw data and other research material is shared such organized skepticism can be implemented, and science can self-correct. One aspect of good scientific practice is Open Data.

Data persistence 👴

Reliable infrastructure for storage and publication (e.g., subject-specific repositories, institutional repositories)

Funding requirements 👮

Plan S principle: “from 2021, scientific publications that result from research funded by public grants must be published in compliant Open Access journals or platforms.” (Sherpa Romeo database; fairsharing.org)

Part II: Literate Progamming

Literate Programming

Communication via code 🗣️💻

Integrate computer code with software documentation in a single document

Minimal requirements of high-quality code? 👆

  • executes what it supposed to execute
  • runs, no defects or problems, and runs not only under some circumstances
  • easy to read, maintain, and extend

Good practices 😇

  • directory structure
  • relative paths: read.csv('./data/foo.csv')
  • compile documents in clean software sessions
  • do not set a working directory (or only globally, at the very beginning of a script) → documents should be self-contained and portable
  • attach information on sessionInfo()

How to design a well structured project directory?

  • use a naming convention that is…
    • human readable: directory names that are easy to understand for you & someone not familiar with the naming convention
    • machine readable: avoid spaces
    • supports sorting: sort list of input files
  • directory names that contain components of the project and can be referenced in the code (e.g. figs, data, etc.)
- ./data
    + `raw_data.csv`
    + `tidy_data.csv`
    + `codebook.txt`
- ./analysis
- ./figures
    + ./interaction_plot.png
    + ./bar_plot.png
- ./paper
- ./presentation
- ./README.md

Variable naming conventions

  • snake_case
  • camelCase

Command Line

  • Savety is important!
  • Simple Commands can do a lot of damagerm -rf /

arrow keys

  • Up + Down select previously executed Commands

cd

  • Change Directory
  • Relative: cd ..\documents
  • Absolute: cd C:\ProgramData

rm

  • rm file.md
  • rm -r ./directory

mv

  • Move
  • mv ./oldfilename.md ./newfilename.md

cat

  • Show contents of file
  • cat file.md

ls

  • List directory contents
  • ls

pwd

  • Show current directory

Additional Tipps

  • >>, &&, ;, |

Windows Powershell

  • When to run as administrator
    • To install global software
  • copy & paste
    • Right Click
    • CTRL+C & CTRL+V depends on enviroment
  • abort a command
    • CTRL+C

Terminal

  • When to use sudo
    • To install global software
  • copy & paste
    • CTRL+C & `CTRL+V
    • CTRL+Shift+C+CTRL+Shift+V`
  • abort a command
    • CTRL+C Kill a process
    • CTRL+Z Pause a process

Programming Paradigms

  • Object Oriented: Java, C#
  • Procedual: C, Pascal
  • Multi Paradigm: Python, R
  • Functional: Clojure, Haskell
  • Reactive: React, Vue

Executable Reports

Getting started: Markdown, RMarkdown, and Quarto I

  • Markdown as a human readable way to style text

  • “Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).” John Gruber, founder of Markdown

  • R and RStudio (not the single IDE that supports RMarkdown, Visual Studio is also a great choice)

  • RMarkdown integrates R code into Markdown language through knitr

  • Quarto: extension of RMarkdown, optimised for language interoperability & CLI

Getting started: Markdown, RMarkdown, and Quarto II

Getting started: Pandoc and Lua I

  • Pandoc:
    • extremely powerful open-source document conversion tool
    • allows for conversion between different (40+) markup languages
    • conversion e.g., between docx, HTML, , and Markdown
  • Lua filters:
    • manipulate Pandoc Abstract Syntax Tree (AST) between writing & parsing phase
    • powerful collection of Pandoc Lua filters available open-source
    • extremely useful to adjust the standard RMarkdown framework for scientific use cases (e.g., blinded version of manuscripts, several bibliographies, etc.)

Getting started: Pandoc and Lua II

Markdown basics I

Text formatting and emphasis

  • bold text can be created with **bold text** or equivalently __bold__
  • italic text can be created with *italic* or equivalently _italic_

Markdown basics II

Sections

  • # A level-one section

  • ## A level-two section with a [link](/url)

  • # An unnumbered section {-}, or equivalently # An unnumbered section {.unnumbered}

. . .

  • always include blank line before a header
  • sections can be labelled and referenced by including an attribute after the header: {#sec:introduction}
  • if you do not specify a section id, Pandoc will automatically create one, e.g. # Reproducible research outputs{#reproducible-research-outputs}.

Markdown basics III

Lists

Bullet list

  • Bullet 1
  • Bullet 2
    • Sub-bullet 1
    • Sub-bullet 2

Numbered lists

  1. Point 1
  2. Point 2
    2.1. First sub-point
    2.2. Second sub-point

Markdown basics IV

Footnotes

Writing in “source” vs. “visual” mode

…mostly a matter of taste 🍷🍺

Narrative text & code integration I

Code chunks

Control how code and its products appear in your compiled report or manuscript. Code chunks are required to have unique names, e.g. {r data2017-tidy}

. . .

Chunk options

Define conditions under which the code is evaluated and how its output is processed within the document. Most frequent options include: eval, include, results, echo. Comprehensive list online, in the RMarkdown reference guide, and for Quarto. Most IDEs allow you to easily switch between different chunks.

Narrative text & code integration II

→ old-school way to specify chunk options

```{r elephant-chunk-1, out.width="20%", fig.align="center", fig.cap="Elephant in the room", echo="fenced"}
knitr::include_graphics(path = "figs/elephant.jpg")
```

Elephant in the room

→ more recently, chunk options can be specified as comments within the actual code chunks to increase readability

```{r elephant-chunk-2}
#| out.width: '20%'
#| fig.align: 'center'
#| fig.cap: 'Elephant in the room'
#| fig.alt: 'A pick elephant portraited in a room'
knitr::include_graphics(path = "figs/elephant.jpg")
```

A pick elephant portraited in a room

Elephant in the room

Narrative text & code integration III

Referencing actual results

# A simple linear regression model
fit <- lm(dist ~ speed, data = cars)
The slope of the regression is
`r round(fit$coefficients[2], digits = 2)`.

The slope of the regression is 3.93

YAML header

  • sets global parameters of document
  • E.g. output, title, author, date
  • YAML is a syntax (YAML Ain’t Markup Language, YAML)
  • tag-value pairs separated by colons
  • indentation is critical!
---
title: "Writing a reproducible research paper"
author: "Julia Schulte-Cloos"
date: 2023-12-01
output:
  bookdown::pdf_document2:
    keep_tex: yes
    number_sections: false
    toc: false
documentclass: scrartcl
---

In doubt about YAML validity? Use an available YAML linter.

Parameterized reports

You can render your document by relying on globally specified parameter (YAML header) that will affect how your code is evaluated, e.g. by focussing only on a subset of your data.

---
title: "My Document"
params:
  alpha: 0.1
  ratio: 0.1
---

Lab Session I 👩‍💻 👨‍💻

Create a reproducible document that…

  • includes a title and your authoring information
  • features a footnote and an image
  • features some real literate programming (e.g., printing some calcuation within the written text)
  • can be knitted both to HTML and PDF

🎁 Bonus

Turn your Rmd file into a Quarto document (…or the other way around!) What are some of the differences?

15:00

Part III: Authoring with markdown and knitr

Layout

Table of contents

  • inclusion after title page
  • add parameter in YAML header
output: pdf_document
  toc: true

Paragraphs and indentation - Pandoc option indent: true in the YAML header

Page margins and spacing - geometry option in the YAML header

Bibliography and citations

Bib-files and citations

Include your literature.bib file in the YAML header (YAML key: bibliography:) Cite any entry as recorded in the .bib-file by calling @palmerdata.2020 for inline citations and [@palmerdata.2020, p.10] for all other references.

CSL (Citation Styles)

If the document specifies a csl style, Pandoc will convert Markdown references, i.e., @palmerdata.2020, to ‘hardcoded’ text and a hyperlink to the reference section in your document.

Biblatex / Natbib

If your document specifies a citation reference package like biblatex or natbib along with the related options, pandoc will create the corresponding LaTeX commands (e.g. \autocite, or \pcite) to create the references from our Markdown references (not recommended because you are not flexible regarding output formats!)

Cross-references

You can cross-reference sections, figures, tables or equations (e.g., \@ref(fig:elephant)).

Bookdown output formats

Cross-referencing is possible in all output formats that are part of the bookdown package (e.g. bookdown::pdf_document2). You can reference a figure by \@ref(fig:elephant) where ‘elephant’ is the name of the code chunk that produces the figure.

Quarto

Quarto uses a slightly different syntax: @fig-elephant

If you specify the colorlinks: true option in the YAML header, the hyperlinks to the respective figure will be colored.

Section labels

If you do not specify a section label, Pandoc will automatically assign a label based on the title of your header. For more details, see the Pandoc manual. If you wish to add a manual label to a header, add {#mylabel} to the end of the section header.

Bibliographic reference list

Simple bibliography list

### References

::: {#refs}
:::

Multiple bibliographies

  • use case: seperate bibliography in the main body of your article and the appendix
  • add the section-refs Lua filter

Part IV: Automatable reports for advanced users

Single-source publishing

  • one document that contains narrative text and code can be rendered to several output formats
  • e.g., blog posts, scientific manuscripts
  • ✌️ benefits? ✌️

Conditional code evaluation

Advantages? 🤔

Approach 1

  • add an option to a code chunk, e.g., eval=knitr::is_html_output()
  • powerful in conjunction with “Not in format” Lua filter ::: {.not-in-format .latex}
  • allows you to exclude entire sections (including all relevant code chunks and text)

Approach 2

  • specify the execute YAML option in a .qmd document
  • global execute options (no indentation, for any type of format)
  • format specific option (indentation, specific for each format)
---
format: 
  html:
    toc: true
    code-fold: true
    execute: 
      echo: true
  pdf: 
    toc: false
    execute: 
      echo: false
execute: 
    warning: false
    message: false
---

Code chunks: reference labels

Reference lables of code chunks

The code chunk option ref.label takes a vector of chunk labels to retrieve the content of the respective chunks.

Use case: adding all code to the Appendix of a manuscript

ref.label can also evaluate R code, e.g. to retrieve the code of all labels within a document (knitr::all_labels()).

# Appendix: All code for this presentation
``{r}
#| ref.label: knitr::all_labels()
#| echo=TRUE
#| eval=FALSE}
``

…or a subset of chunks that are also evaluated when rendering the document:

labs <- knitr::all_labels(eval == TRUE)
``{r}
#| ref.label: labs
``

Quarto for Literate Programming

Language interoperability

  • integration of narrative text and code in multiple programming languages (e.g., R, Python, Julia)
  • in addition to knitr engine, jupyter can be used
---
title: "My Document"
jupyter: python3
---

Global code chunk options within the YAML header (execute)

execute key can replace a chunk that globally sets knitr options (as in the RMarkdown framework)

---
title: "All code chunks in this document are not printed by default"
execute:
  echo: false
---

Quarto for scholarly writing

Quarto offers more control regarding the inclusion of author-related meta-data (names, affiliations, contributions to the work) that is printed as part of the title, in some output formats. See the full documentation

author:
  - name:
      given: Norah
      family: Jones
      literal: Norah Jones
    attributes:
      corresponding: true

Custom appendices

You can also include the attribute {.appendix} after any header (at any point of your document) to delegate the respective section to the appendix.

Quarto and figures

Cross-referencing figures

⚠️ Quarto uses a slightly different syntax to cross-reference figures: @fig-elephant

Side-by-Side figures

Quarto allows you to add a dedicated code chunk option #| layout-ncol: 2 to your code chunks to include several figures side by side.

This is very powerful in conjunction with #| fig-subcap: which allows you to specify captions for each of the figure.

#| label: fig-graphsidebyside
#| fig-subcap: ["Caption of left figure","Caption of right figure"]

Bonus exercises

Lab Session II 👩‍💻 👨‍💻

Create a reproducible document that…

  • adds a bib-file and cites some work out of it
  • features 1.5 line spacing in its PDF version
  • includes cross-references to a regression table (hint: try to work with modelsummary)

🎁 Bonus

Integrate two tables side-by-side, each with its own sub-caption (hint: Quarto makes it quite easy to solve this task)

20:00